Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2857: Runtime Assisted Mounting of Persistent Volumes #2893

Closed
wants to merge 1 commit into from

Conversation

ddebroy
Copy link
Member

@ddebroy ddebroy commented Aug 24, 2021

  • One-line PR description: Provisional KEP for Runtime Assisted Mounting of Persistent Volumes
  • Other comments: Initial KEP PR

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 24, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Aug 24, 2021
@ddebroy ddebroy requested review from pohly and jsafrane August 24, 2021 06:06
@ddebroy ddebroy changed the title KEP-2857: Provisional KEP for Runtime Assisted Mounting of Persistent Volumes [WIP] KEP-2857: Runtime Assisted Mounting of Persistent Volumes [WIP] Aug 24, 2021
@ddebroy ddebroy force-pushed the runtime1 branch 2 times, most recently from 986ace7 to e4721ee Compare August 24, 2021 06:52
on a persistent volume, it is not expected to support any deferred file system
management operations since the runtime will not own a file system mount to manage.

A new CSI service: `Runtime` is proposed to coordinate the file system management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean gRPC service here? Runtime is going to be implemented by a container runtime, not a CSI driver, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - a gRPC service that will be implemented by a container runtime and not CSI Node Plugin. Will clarify that.

should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
# KEP-2857: Runtime Assisted Mounting of Persistent Volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"generic ephemeral volumes" also would be handled this way. It would be better to keep "persistent" out of the KEP name.

As this is designed to work only with volumes provided by a CSI driver, perhaps replace "Persistent" with "CSI"?

volumes to the container runtime is a basic requirement in the context
of this KEP. Overview of the enhancements necessary to support this are:
- A new field in RuntimeClass to indicate a handler/runtime can support mount
deferral.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth calling out here early that NodeStageVolume cannot be used when a CSI driver wants to support deferred mounts.

stage the volume (ensuring it is formatted) and mount the filesystem
(associated with the PV) in the node host OS environment. This stage happens
irrespective of the runtime's capabailities indicated in the subsequent CSI
NodeVolumePublish call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of putting this work into NodeStageVolume? Why not skip the call and do the work in NodePublishVolume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put a note clarifying ^ is the preferred approach.

// SupportsFSMounts indicates whether the Handler supports mounting of
// FileSystem for persistent volumes referred by the pod
// Default False
SupportsFSMounts bool `json:"supportsFSMounts,omitempty" protobuf:"bytes,5,opt,name=supportsFSMounts"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API reviewers will know this better, but I suspect this might have to be a pointer to handle cases where the client doesn't support the field (set vs. not set).

// by the Handler over which FileSystem Management APIs can be invoked. Should
// not be specified if Handler does not support FileSystem Management APIs.
// +optional
FSMgmtSocket *string `json:"fsMgmtSocket,omitempty" protobuf:"bytes,6,opt,name=fsMgmtSocket"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the path is the same for all nodes in the cluster, doesn't it? Can that be guaranteed?

Copy link
Member Author

@ddebroy ddebroy Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will put a recommendation to consider separate RuntimeClass specs with Scheduling (https://kubernetes.io/docs/concepts/containers/runtime-class/#scheduling) to target the different classes of nodes with the same runtime but different configuration for the domain socket path and specify the corresponding domain socket path for each RuntimeClass.

```

During the Alpha phase, the above fields can be introduced as annotations on
RuntimeClass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using annotations instead of proper fields has fallen out of favor as far as I know.

// network share when SP defers file system mounts to a container runtime.
// A SP MUST populate this if runtime_supports_mount was set in
// NodePublishVolumeRequest and SP is capable of deferring filesystem mount to
// the container runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new capability needs to be defined. Otherwise kubelet won't know whether the SP checks runtime_supports_mount and sets this field.

Note that the OCI spec already allows the `type` and `options` fields to be
specified as part of the `mount` field. Therefore, no enhancements should be
necessary in the OCI spec. OCI runtimes supporting mount deferral need to be
able to execute the filesystem mount with the details specified in `mounts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be checked? Do users have to ensure that they don't ask for something that isn't supported, like creating a volume with XFS as filesystem and then using that with a runtime that doesn't support XFS?

Comment on lines 428 to 431
- The node CSI plugin receives a CSI `NodeVolumePublish` call. In response, the
CSI plugin will unmount the file system (if mounted as part of stage) and pass
the block device path (rather than a file system mount point) along with file-system
type and mount options to the Kubelet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mounting in NodeStage and unmounting in NodePublish is just waste of resources. A driver should announce its capability to "provide a block volume for filesystem PV" and kubelet should do the right thing right away.

specifying details of how and what to mount on a block device or network share.
```
message FileSystemMountInfo {
// Source device or network share (e.g. /dev/sdX, srv:/export) whose mount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of CSI is to hide all storage details from Kubelet and the host OS into CSI drivers. If we want a micro VM to mount a generic CSI volume, then it needs to support all possible filesystems that any CSI driver may use, for example CephFS, CIFS, GlusterFS and NFS, mentioning just the major open source ones.

  • How does kubelet know what filesystems the container runtime / VM supports? What if the VM does not have utils to mount GlusterFS, but a pod uses a Gluster PV? Or kernel module for zfs?

  • Some CSI drivers (say Gluster or CephFS) may need a config file to mount a filesystem or even get credentials from a Secret. If a VM had utils for these filesystems, how does it get the config + secrets?

  • Some proprietary CSI drivers need a kernel module or fuse daemon. Are they supposed to run in the VM?

It is possible to solve all the above, but not without leaking the storage details from the CSI driver either back to kubelet or to the micro VM / its OS / its micro kubelet (or whatever it runs to mount volumes).

Could the micro VM run the CSI driver in the end?

Another possibility would be to limit this proposal to volumes based on block devices. FSType is already quite exposed in Kubernetes API and everyone sort of supports ext4 and xfs. Kubelet could know that a CSI driver supports "block devices for filesystem volumes" capability and it could know that pod's RuntimeClass can mount them (with a list of supported filesystems, say xfs and ext4). Kubelet then could call NodePublish (or even some new CSI call) to get a "block device for a filesystem PV" and give it to CRI to mount. For PVs that use filesystems not supported by the VM (say zfs) or don't support block devices (NFS, CephFS, ...) or don't support the new capability (all legacy CSI drivers in the beginning), virtio-fs would be used as it is today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree about the leaking of storage details to cover the broad range of scenarios handled by CSI. I like the suggestion of limiting this proposal to block devices with a list of supported file systems (and explaining that in Limitations/Caveats). I was wondering if you see any major challenges with basic NFS support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very conservative about adding on-off support for some filesystems, leaving the other unsupported. Even asking drivers to unify on ext4/xfs support is a pretty big requirement. To me it seems we're moving away from CSI ("do whatever you want in your driver, as long there is filesystem mounted in the end") to more strict environment, where a storage backed vendor is forced to implement their plugin in a limited way. We're abandoning all the new fancy storage projects and companies here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is not advocating that certain plugin capabilities be turned off in any way or the existing virtiofs path be disabled. It's completely up to the plugin to decide whether it wants to take part in the deferred mount path. If the plugin wants to offer certain capabilities/file systems that a microvm based runtime environment cannot support, it is fine for things to fall back to the regular/existing virtiofs based path if microvm support is desired. I can emphasize this in the KEP overview sections that the fall-back path is okay.

CSI plugin will unmount the file system (if mounted as part of stage) and pass
the block device path (rather than a file system mount point) along with file-system
type and mount options to the Kubelet.
- The kubelet passes the block device, file system type and mount options (from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it pass potential NodePublish secrets? They're allowed in CSI and used by few CSI drivers.

Copy link
Member Author

@ddebroy ddebroy Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the secret is necessary for actions prior to the FS mount step, the CSI plugin is expected to consume the secrets and use them. For example, if a block based CSI plugin layers a dm-crypt device on top of the underlying physical block device (using a key for encryption passed as a secret) as part of node publish, it is expected to continue to do so. In such scenarios, the CSI plugin will be expected to pass the top level logical block device where that the FS mount will target through the new FileSystemMountInfo in NodePublishVolumeResponse - and this will not require access to the secret.

If the secret is necessary for executing the mount like in case of SMB/CIFS, then the CSI plugin is expected to resolve the secret and pass it to the runtime as the corresponding mount parameter in FileSystemMountInfo in NodePublishVolumeResponse. There is the caveat around the fact that the CRI container runtime may persist the OCI spec for the containers on disk - so this may be unsafe until the necessary plumbing happens in the runtimes to pass the OCI spec in memory.

I will call out above in the KEP. Are there other scenarios and patterns around consumption of CSI secrets beyond the two buckets above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we limit this proposal only to block volumes, where the VM does just mount(), then secrets should work fine.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2021
@ddebroy
Copy link
Member Author

ddebroy commented Dec 7, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2021
@fengwang666
Copy link

We (Databricks) have a use case that are blocked this KEP. Without this, it's really awkward to use block devices with VM based containers.

@ddebroy
Copy link
Member Author

ddebroy commented Feb 8, 2022

Update from sig-node on Feb 8th: the CRUST APIs between Kubelet <=> Runtime Handler will have to be plumbed through CRI APIs. Runtime handler specific actions in Kubelet is not desired.

@c3d
Copy link

c3d commented Feb 9, 2022

@ddebroy This is very interesting. I added an issue on the Kata side to make sure we track this.

Are you aware of any similar effort for CNI? I'm asking because Kata has very similar requirements / needs there.

@ddebroy
Copy link
Member Author

ddebroy commented Feb 9, 2022

@c3d thanks! I am not aware of a similar effort for CNI. I could be wrong but my understanding is that with containerd/crio, most of the CNI interactions are handled by the CRI runtime. So that could be accomplished without Kubelet changes?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@ddebroy ddebroy force-pushed the runtime1 branch 3 times, most recently from ccade34 to 31b8bec Compare October 3, 2022 10:56
Comment on lines 398 to 402
A CSI node plugin capable of deferring file system operations should not perform
any mount operations during `NodeStageVolume` and instead perform it during
`NodePublishVolume` based on whether to defer mounts to a container runtime
handler (as detailed below). Such plugins need to implement support for CSI
API enhancements below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big change - most (all?) CSI drivers I met do mount volumes during NodeStage.
Is there a way how to test for this in an e2e tests / csi sanity test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. One quick clarification about testing and enforcing: if a CSI plugin wishes to support a file system that is backed by a block device and also supports parallel FS mounts there won't be any problems (from a functionality perspective) if it wishes to mount the file system in the host during NodeStageVolume. Is it still recommended to enforce this with a csi sanity or a storage e2e test targeting arbitrary storage classes?

Comment on lines +853 to +875
To avoid such corruption, ReadWriteOncePod access mode should be used for PVCs
whose associated storage class specifies deferred mounting of file system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, PV with ReadWriteOncePod access mode should be pre-requisite for using this feature. Otherwise kubelet will fall back to the the regular mounting + CRI handler will use virtiofs. As described above, ReadWriteOnce is dangerous.

Since there is no way how to enforce ReadWriteOncePod for CSI inline volumes, I would limit this feature only for PVs.

keps/sig-storage/2857-runtime-assisted-pv-mounts/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2857-runtime-assisted-pv-mounts/README.md Outdated Show resolved Hide resolved
Comment on lines 1083 to 1085
When processing CSI NodePublishVolume, if a CSI plugin finds `defer_fs_ops` is
set, the CSI plugin will invoke `RuntimeStageVolume` on crust_ops which will
persist a JSON file called `mountInfo.json` (in the directory mentioned above)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me to understand the document if interface between kubelet and crust_ops would be described first (RuntimeStageVolume, RuntimeUnstageVolume, ...) and then describe detailed implementation of crust_ops ( content of /var/run/crust).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured a bit. Let me know if it helps readability now.

returns (RuntimeExpandVolumeResponse) {}
}

message RuntimeStageVolumeRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will be this API versioned? If we add a new field in Kubernetes x.y, how will old crust_ops / CRI handler interpret it? How will kubelet know that CRI runtime does not understand the field?

Comment on lines 1135 to 1136
$ cat /var/run/crust/L3Nydi9...L21vdW50/runtime-cli
/bin/kata-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is filed by crust_ops from some config file, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this is populated by the container runtime handler when handling OCI create for the containers in the pod.

Comment on lines +455 to +457
expected to interact with crust_ops using a GRPC API over mounted domain
sockets. Container runtimes are expected to implement a well defined protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it below, how will CSI plugins find the socket?

Signed-off-by: Deep Debroy <ddebroy@gmail.com>
@xing-yang xing-yang added the api-review Categorizes an issue or PR as actively needing an API review. label Oct 5, 2022
@xing-yang
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 5, 2022
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question but generally PRR is good.


Rolling back/downgrading a CSI plugin version with respect to its capabilities
(to defer file system mount and management operations to a container runtime
handler) will lead to certain user observable failures/errors (when resizing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal but as long as SIG storage is OK with it and it is documented here, I am OK with it from a PRR perspective. Certainly it's not an issue in alpha. It would be helpful to document for users how to identify pods and pod templates that will fail, so they may be cleaned up if necessary.

Is the described behavior only when the CSI plugin is downgraded but the feature gate is still enabled? That is, what is the behavior matrix with CSI plugin incapable/capable and feature gate enabled/disabled?

that might indicate a serious problem?
-->

If number of pods (mounting PVCs) get stuck in Initializing/ContainerCreating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there specific, published metrics for these things? Please include their names if so.

@zvonkok
Copy link

zvonkok commented Dec 14, 2022

/cc @zvonkok

@k8s-ci-robot k8s-ci-robot requested a review from zvonkok December 14, 2022 11:10
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haosdent
Copy link
Member

@ddebroy Are you still working on this?

@ddebroy
Copy link
Member Author

ddebroy commented Jan 29, 2024

hi, I am not at this point - we ran into some significant challenges with a PoC of this at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.